DETECTING SPAM EMAILS USING MACHINE LEARNING ALGORITHM
Introduction
Spam is a term used to describe an unrequested and undesirable message transmitted electronically by a sender who has no existing connection with the recipient (Cormack, 2006). There are multiple categories of electronic spam. Spam messages can be transmitted through several communication methods, including email, SMS, social networks, and online commerce platforms. Email spam not only wastes users' time, as they have to identify and delete unwanted messages, but it also occupies valuable mailbox storage and hides crucial personal emails (Zhang et al., 2004). On the other hand, SMS spam is commonly sent using a mobile network (Delany et al., 2012). In recent times, there has been a growing focus on social network spam by scholars and practitioners. This is because of the significant number of spammers and the possible adverse impact that social network spam can have on the convenience and comprehension of all the followers (Zhou et al., 2014). According to a meta-analysis of over 20 empirical research conducted by Floyd et al. (2014), both the number and valence of reviews have been found to be important factors in determining retail sales. This is especially applicable to high-involvement products that can only be evaluated after they have been consumed. The premise of consumers' experience of product use is crucial. According to a recent survey conducted by BrightLocal in 2018, over 80% of consumers have the same level of trust in online reviews as they do in personal recommendations. Spam filtering is given significant attention in the aforementioned communication channels.
Spam messages can be sorted either using manual or automated filtering methods. Undoubtedly, the process of manually filtering spam by recognising spam messages and eliminating them is a laborious task that consumes a significant amount of time. Additionally, spam communications can pose a security risk by including links to phishing websites or servers that harbour malware. Consequently, for several decades, researchers and practitioners have dedicated their efforts to enhancing automatic spam filtering systems. Machine learning algorithms are renowned for their exceptional accuracy in identifying spam messages. The fundamental principle of machine learning algorithms is to construct a vocabulary and assign a corresponding weight to each word. Spammers often incorporate typical genuine messages into their spam messages to reduce the likelihood of being identified. Several machine learning techniques are commonly used for spam filtering, including neural networks (NNs) (Barushka and Hajek, 2016), support vector machines (SVMs) (Bhowmick and Hazarika, 2018), Naïve Bayes (NB) (Almeida et al., 2011), and random forest (RF) (Choudhary and Jain, 2017).
The survey conducted by Kaur et al. (2018) found that ensemble learning techniques, such as bagging and random forest, provide superior performance compared to traditional single classifiers. Ensemble methods integrate the predictions of multiple underlying machine learning algorithms to enhance accuracy and resilience compared to individual algorithms. Prior research has utilised ensemble approaches to successfully employ conventional classifiers such as decision trees for the purpose of effectively screening out spam messages. Surprisingly, there has been a lack of focus on neural networks (NNs) combined with ensemble learning. Recent data suggests that neural networks, when supplemented with regularisation approaches, can achieve high accuracy in detecting spam in emails and SMS messages (Barushka and Hajek, 2016). This can be ascribed to improved optimisation convergence and resilience against overfitting. In order to harness these attributes, this doctorate thesis combines regularised neural networks with ensemble learning techniques for the purpose of automated spam filtering. To improve the performance of the suggested technique, rectified linear units and dropout regularisation are employed in deep feed-forward neural networks (DFFNNs). This is done to overcome the difficulty of optimisation convergence to a suboptimal solution, which is often encountered in typical shallow neural network models.
The spam filtering work is typically classified as a binary classification problem, where each message is categorised as either spam or ham. Aside from achieving high accuracy, it is crucial for spam filtering algorithms to also excel in minimising the false positive ratio, which refers to the classification of valid messages as spam. This is necessary to prevent instances where legitimate messages fail to reach their intended recipients. Furthermore, when considering accuracy as a performance measure for classification, it fails to include the varying costs associated with type I and type II errors. When dealing with imbalanced spam datasets, relying on accuracy alone can result in misleading findings. This is because the minority class, which typically represents spam messages, has a minimal impact on accuracy compared to the majority class of legitimate communications. Hence, it is imperative to take into account several performance metrics while assessing the efficacy of spam filtering algorithms.
The primary concept behind content-based machine learning models is to create a list of words or phrases and assign a weight to each one. This can be done by using a bag-of-words approach or by categorising words based on their part of speech or psycholinguistic properties (Crawford et al., 2015). Nevertheless, these characteristics are plagued by sparsity, hence posing a challenge in capturing the semantic representation of communications. In order to tackle this problem, Ren and Ji (2017) put forward a gated recurrent neural network model for the purpose of identifying review spam. This method employed word embeddings generated through the use of the CBOW (continuous bag-of-words) model (Mikolov et al., 2013; Le and Mikolov, 2014) to map words.
Vectorizing depending on contextual information. Therefore, it is possible to acquire comprehensive global semantic knowledge, which helps to mitigate the issue of limited data to some extent. According to reports, this strategy was more effective than typical bag-of-words or part-of-speech tagging methods (Lilleberg et al., 2015). Building upon these recent discoveries, this doctorate thesis use word embeddings to acquire the semantic representation of e-mails, SMS, social network messages, and online reviews. Word2Vec, developed by Mikolov et al. in 2013 and further improved by Le and Mikolov in 2014, is a widely used technique for generating word embeddings, which are vector representations of words, from a collection of text data. The Word2Vec word representation can be acquired by two different model designs, specifically CBOW or skip-gram. This doctoral thesis diverges from previous literature by employing a Skip-Gram model, which effectively leverages word context to produce a more universally applicable context in comparison to the CBOW model (Mikolov et al., 2013). In order to train the Skip-Gram model, I employ the hierarchical softmax approach, which is a computationally efficient variant of the softmax technique. In order to improve the accuracy of detection, I integrate the word embeddings that were generated with the bag-of-words approach in the initial step. During the second stage, the spam filtering model is trained using an ensemble learning method. The base learners in this model are represented by DFFNNs, which are equipped with regularisation techniques and corrected linear units, to distinguish spam and valid messages.
This thesis aims to develop a new machine learning model based on DFFNN ensembles using a high-dimensional feature representation for spam filtering in diverse communication channels.
The remainder of this dissertation thesis is organized as follows. Chapter 1 reviews related work on filtering spam messages. Chapter 2 sets the objectives of this dissertation thesis. Chapter 3 introduces the proposed research methodology. Chapter 4 presents the datasets used for the experimental comparison and Chapter 5 introduces the strategies for data preprocessing and feature selection. Chapter 6 outlines the proposed spam filtering model and Chapter 7 briefly introduces the state-of-the-art models used for comparisons. Chapters 8 and 9 present the experimental settings and results, as well as a comparative analysis with the state-of-the-art methods used for spam filtering. Chapter 10 discusses the limitations and suggests possible future directions. Chapter 11 presents the theoretical and application contributions of this dissertation thesis and the last chapter concludes.